-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility Issue with Chinese Text in Document Parsing #3530
Draft
Coniferish
wants to merge
35
commits into
main
Choose a base branch
from
jj/zh_adaptation
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…apply to text type classification - Added a `languages` attribute to the Document base class. This attribute is essential to express the current language nature of a document, as language issues are encountered in various methods across the document. Having a common language array as a default value is necessary, and this attribute also partially meets the requirements of domain-driven design. - Added `languages` option to `DocxPartitionerOptions` to specify a list of languages to use for text type classification. - Modified `_DocxPartitioner.detect_text_type()` to use the specified languages or automatically detect the languages if "auto" is specified. - This allows the partitioner to more accurately classify text elements based on the language, improving the overall partitioning quality. - For HTML and MD (MD utilizes the HTML partition method), the `languages` field is passed through the entire construction chain until it is finally used in the `is_possible_narrative_text` and `is_possible_title` functions. Previously, although these two functions supported different judgments for different languages, the `languages` parameter was not correctly passed, which led to this capability not being enabled. This update enables this capability. - **BREAKING CHANGE**: The `DocxPartitionerOptions` constructor and some other partition functions now require a new `languages` parameter. This is a breaking change for any existing code. However, since most parameters have default values, it is not entirely a breaking change. This is merely a warning. In fact, docx and md test cases have been retested and passed, and simple test cases for the new feature have been submitted to ensure the functionality works correctly. --- ### feat(unstructured/partition/docx.py): 添加语言检测并应用于文本类型分类 - 在 Document 基础类中添加了 `languages` 属性。文档应该具有一个类似的属性来表达文档当前的语言性质,因为在文档的各个方法中都会遇到语言问题。在这些场景中,有一个公共的语言数组作为默认值是必要的,而且这个属性在某种程度上也满足了领域驱动设计的要求。 - 在 `DocxPartitionerOptions` 中添加了 `languages` 选项,用于指定用于文本类型分类的语言列表。 - 修改了 `_DocxPartitioner.detect_text_type()`,以使用指定的语言或在指定为 "auto" 时自动检测语言。 - 这使得分区器能够更准确地基于语言对文本元素进行分类,从而提高整体分区质量。 - 对于 HTML 和 MD(MD 利用了 HTML 的分区方法),`languages` 字段在整个构造链中一路传递,直到在 `is_possible_narrative_text` 和 `is_possible_title` 函数中最终使用。此前,虽然这两个函数支持针对不同语言进行不同的判断,但 `languages` 参数没有正确传递,这导致这一能力一直未被启用。本次更新启用了这一能力。 - **破坏性更改**: `DocxPartitionerOptions` 构造函数和其他一些分区函数现在需要一个新的 `languages` 参数。这对于现有的代码是一个破坏性更改。然而,由于大多数参数都有默认值,所以并不完全算是破坏性更新,这仅是一个警告。实际上,docx 和 md 的测试用例已经重新测试并通过,同时针对新的功能也提交了简单的测试用例以确保功能正常运行。
进行了全量测试,并基本保持了与main分支一致的通过率。
… "DocxPartitionerOptions" to collapse into keyword arguments (kwargs). 2. Change "capitalizable_languages" to "non_capitalizable_languages" in the function "is_possible_narrative_text".
…ure/zh_adaptation # Conflicts: # CHANGELOG.md
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
…ure/zh_adaptation
# Conflicts: # test_unstructured/documents/test_html.py # unstructured/documents/base.py # unstructured/documents/html.py # unstructured/documents/xml.py # unstructured/partition/epub.py # unstructured/partition/html.py
…ure/zh_adaptation
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
This commit resolves an issue where the method 'is_possible_narrative_text' would incorrectly return 'True' for an empty list of languages. The corrected state should instead return 'False' for such situations.
…e code formatting - Update CHANGELOG.md to include compatibility issue fix for Chinese text in document parsing. - Reformat import statements in test_odt.py for better readability. - Adjust import order in html.py to adhere to PEP8 guidelines. - Add `languages` parameter to text processing functions in pdf.py and text.py for improved language handling. - Reformat long lines to improve code readability and maintain consistency. Co-authored-by: Your Name <[email protected]>
# Conflicts: # unstructured/documents/html.py # unstructured/partition/html.py
…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.
…cOS, gsed (installed via brew) replaces the default sed. The script includes platform checks to use gsed on MacOS and sed on Linux. Additionally, awk is used for version extraction. Preliminary tests indicate the script works correctly on both Linux and MacOS.
…ure/zh_adaptation
Added logic in the `test_weaviate_schema_is_valid` test function to check the existing Weaviate schema. If the class to be created already exists, the creation step is skipped and a corresponding message is printed to avoid creating a duplicate class.
# Conflicts: # CHANGELOG.md # examples/pgvector/pgvector.ipynb # examples/training/0-Core Concepts.ipynb # examples/training/1-Intro to Bricks.ipynb # examples/training/2-File Exploration.ipynb # examples/weaviate/weaviate.ipynb # test_unstructured/partition/test_auto.py # unstructured/documents/html.py
…zh_adaptation # Conflicts: # CHANGELOG.md # unstructured/__version__.py
# Conflicts: # CHANGELOG.md
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Duplicate of #3267 since forked PRs are failing to pass chipper CI tests